Show the code
<<<<<<< HEADknitr::opts_chunk$set(tidy = FALSE,fig.show='hold',fig.align='center')
=======
knitr::opts_chunk$set(tidy = FALSE, out.width="49%", out.height="20%",fig.show='hold',fig.align='center')
>>>>>>> 2a872009eea3ee9cb7110dfa63b05c23942bec27
options(htmltools.dir.version = FALSE)
library(ggplot2)
library(simone)
library(SpiecEasi)
library(huge)
library(Matrix)
library(ghibli)
1 Introduction
Probabilistic graphical models (Lauritzen 1996; Koller and Friedman 2009) are widely used in high-dimensional data analysis to synthesize the interaction between variables. In many applications, such as genomics or image analysis, graphical models reduce the number of parameters by selecting the most relevant interactions between variables. Undirected Gaussian Graphical Models (GGMs) are a class of graphical models used in Gaussian settings. In the context of high-dimensional statistics, graphical models are generally assumed sparse, meaning that a small number of variables interact, compared to the total number of possible interactions. This assumption has been shown to provide both statistical and computational advantages by simplifying the structure of dependence between variables (Dempster 1972) and allowing efficient algorithms (Meinshausen and Bühlmann 2006). See, for instance, (Fan, Liao, and Liu 2016) for a review about sparse graphical models inference.
In GGMs, it is well known (see, e.g., (Lauritzen 1996)) that inferring the graphical model or equivalently the Conditional Independence Graph (CIG) boils down to inferring the support of the precision matrix \mathbf{\Omega} (the inverse of the variance-covariance matrix). To do so, several \ell_1 penalized methods have been proposed in the literature to learn the CIG of GGMs. For instance, the neighborhood selection (Meinshausen and Bühlmann 2006) (MB) based on a nodewise regression approach via the least absolute shrinkage and selection operator (Lasso, (R. Tibshirani 1996)) is a popular method. Each variable is regressed on the others, taking advantage of the link between the so-obtained regression coefficients and partial correlations. More precisely, for all 1 \le i \le p, the following problem is solved:
\hat{\boldsymbol{\beta}^i}(\lambda) = \underset{\boldsymbol{\beta}^i \in \mathbb{R}^{p-1}}{\operatorname{argmin}} \frac{1}{n} \left \lVert \mathbf{X}^i - \mathbf{X}^{\setminus i} \boldsymbol{\beta}^i \right \rVert_2 ^2 + \lambda \left \lVert \boldsymbol{\beta}^i \right \rVert_1. \tag{1}
In Equation 1, \lambda is a non negative regularization parameter and \mathbf{X}^{\setminus i} denotes the matrix \mathbf{X} deprived of column i. The MB method defined by the estimation problem
Another family of sparse CIG inference methods directly estimates \mathbf{\Omega} via direct minimization of the \ell_1-penalized negative log-likelihood (Banerjee, El Ghaoui, and d’Aspremont 2008), without resorting to the auxiliary regression problem. This method, called the graphical lasso (Friedman, Hastie, and Tibshirani 2007), benefits from many optimization algorithms (Yuan and Lin 2007; Rothman et al. 2008; Banerjee, El Ghaoui, and d’Aspremont 2008; Hsieh et al. 2014).
Such inference methods are widely used and enjoy many favorable theoretical and empirical properties, including robustness to high-dimensional problems. However, some limitations have been observed, particularly in the presence of strongly correlated variables. These limitations are caused by known impairments of Lasso-type regularization in this context (Bühlmann et al. 2012; Park, Hastie, and Tibshirani 2006). To overcome this, in addition to sparsity, several previous works attempt to estimate CIG by integrating clustering structures among variables for the sake of both statistical sanity and interpretability. A non-exhaustive list of works that integrate a clustering structure to speed up or improve the estimation procedure includes (Honorio et al. 2009; Ambroise, Chiquet, and Matias 2009; Mazumder and Hastie 2012; Tan, Witten, and Shojaie 2013; Yao and Allen 2019; Devijver and Gallopin 2018).
The above methods exploit the group structure to simplify the graph inference problem and infer the CIG between single variables. Another question that has received less attention is the inference of the CIG between the groups of variables, i.e., between the meta-variables representative of the group structure. A recent work introducing inference of graphical models on multiple grouping levels is (Cheng, Shan, and Kim 2017). They proposed inferring the CIG of gene data on two levels corresponding to genes and pathways, respectively. Note that pathways are groups of functionally related genes known in advance. The inference is achieved by optimizing a penalized maximum likelihood that estimates a sparse network at both gene and group levels. Our work is also part of this dynamic. We introduce a penalty term allowing parsimonious networks to be built at different hierarchical clustering levels. The main difference with the procedure of (Cheng, Shan, and Kim 2017) is that we do not require prior knowledge of the group structure, which makes the problem significantly more complex. In addition, our method has the advantage of proposing CIGs at more than two levels of granularity.
We introduce the Multiscale Graphical Lasso (MGLasso), a novel method to estimate simultaneously a hierarchical clustering structure, and graphical models depicting the conditional independence structure between clusters of variables at each level of the hierarchy.
In the spirit of convex hierarchical clustering, introduced by Hocking et al. (2011), Lindsten, Ohlsson, and Ljung (2011), the hierarchy is obtained by spanning the entire regularization path.
\frac{1}{2} \sum_{i=1}^n \left \lVert \boldsymbol x_i - \boldsymbol \alpha_i \right \rVert_2^2 + \lambda \sum_{i < j} w_{ij} \left \lVert \boldsymbol \alpha_i - \boldsymbol \alpha_j \right \rVert_q \tag{2}
So, the key novelty of the MGLasso is the adaptation of the global likelihood maximization problem to the neighborhood selection framework (Meinshausen and Bühlmann 2006) while using fusion penalties via the maximization of a pseudo-likelihood function. Our fusion penalty differs from the ones recensed in the literature. We enforce fusion at the scale of the whole regression vector while permuting symmetric regression coefficients. Namely we use the general fusion term \sum_{i < j} \left \lVert \boldsymbol{\beta}^i - \tau_{ij}(\boldsymbol{\beta}^j) \right \rVert_2 where \tau_{ij} is a permutation function on symmetric coefficients. Morover we use a conesta …
======= <<<<<<< HEADSo, the key novelty of the MGLasso is the adaptation the global likelihood maximization problem to the neighborhood selection framework (Meinshausen and Bühlmann 2006) via the maximization of a pseudo-likelihood function. Moreover our fusion penalty differs from the ones recensed in the literature. We enforce fusion at the scale of the whole regression vector while permuting symmetric regression coefficients. Namely we use the general fusion term \sum_{i < j} \left \lVert \boldsymbol{\beta}^i - \tau_{ij}(\boldsymbol{\beta}^j) \right \rVert_2 where \tau_{ij} is a permutation function on symmetric coefficients.
>>>>>>> ef87e32d4a32c88fad3b530b9a141e7d01039c9bThe remainder of this paper is organized as follows. In section Multiscale Graphical Lasso, we formally introduce the Multiscale Graphical Lasso based on a convex estimation problem and an optimization algorithm based on the continuation of Nesterov’s smoothing technique (Hadj-Selem et al. 2018). Section Simulation experiments presents numerical results on simulated and real data.
2 Multiscale Graphical Lasso
Let X = (X^1, \dots, X^p)^T \in \mathbb R^p be a p-dimensional Gaussian random vector, with mean vector \mu and covariance matrix \mathbf \Sigma. The conditional independence structure of X is characterized by a graph G = (V, E), where V = \{1,\ldots p\} is the set of variables and E the set of edges, uniquely determined by the support of the precision matrix \mathbf{\Omega} = \mathbf{S}^{-1} (see, e.g., (Dempster 1972)). In other words, for any two vertices i,j\in V, the edge (i,j) belongs to the set E if and only if \Omega_{ij} \neq 0, that is if and only if the i-th and j-th variables are conditionally independent given all the others i.e. X^i \perp \!\!\! \perp X^j |X^{\setminus (i, j)} where X^{\setminus (i, j)} is the set of all p variables deprived of variables i and j.
Considering the linear model X^i = \sum_{j\neq i} \beta^i_j X_j + \epsilon_i where \epsilon_i is a Gaussian centered random variable, we have \beta^i_j = -\frac{\Omega_{ij}}{\Omega_{ii}}. We define the regression matrix \boldsymbol{\beta} := [{\boldsymbol{\beta}^1}^T, \ldots, {\boldsymbol{\beta}^p}^T]^T \in \mathbb{R}^{p \times (p-1)}, whose rows are the regression vectors for each of the p regressions.
Let the n \times p-dimensional matrix \mathbf{X} contain n independent observations of X. We propose to minimize the following criterion which combines Lasso and group-fused Lasso penalties:
J_{\lambda_1, \lambda_2}(\boldsymbol{\beta}; \mathbf{X} ) = \frac{1}{2} \sum_{i=1}^p \left \lVert \mathbf{X}^i - \mathbf{X}^{\setminus i} \boldsymbol{\beta}^i \right \rVert_2 ^2 + \lambda_1 \sum_{i = 1}^p \left \lVert \boldsymbol{\beta}^i \right \rVert_1 + \lambda_2 \sum_{i < j} \left \lVert \boldsymbol{\beta}^i - \tau_{ij}(\boldsymbol{\beta}^j) \right \rVert_2, \tag{3}
where \tau_{ij} is a permutation exchanging the coefficients \boldsymbol{\beta}^j_j and \boldsymbol{\beta}^j_i and leaves other coefficients untouched, \mathbf{X}^{i}\in \mathbb{R}^n denotes the i-th column of \mathbf{X}, \boldsymbol{\beta}_{i} denotes the i-th row of \beta, \lambda_1 and \lambda_2 are penalization parameters. Let us consider
\hat{\boldsymbol{\beta}} \in \underset{\boldsymbol{\beta}}{\operatorname{argmin}} J_{\lambda_1, \lambda_2}(\boldsymbol{\beta}, \mathbf{X}).
The lasso penalty term encourages sparsity and the penalty term \left \lVert \boldsymbol{\beta}^i - \tau_{ij}(\boldsymbol{\beta}^j) \right \rVert_2 encourages to fuse regression vectors \boldsymbol{\beta}^i and \boldsymbol{\beta}^j. These fusions uncover a clustering structure. The model is likely to cluster together variables that have the same conditional effects on the others. Variables X^i and X^j are assigned to the same cluster when \boldsymbol{\beta}^i = \tau_{ij}(\boldsymbol{\beta}^j).
Let us illustrate by an example the effect of the proposed approach. If we consider a group of q variables whose regression vectors have at least q non-zero coefficients and further assume that for each pair of group variables i and j, \|\boldsymbol{\beta}^i - \tau_{ij} (\boldsymbol{\beta}^j)\|_2=0. After some permutations, we get a q\times q block of non-zeros coefficient \beta_{ij} corresponding to the group in the \boldsymbol{\beta} matrix, where (i,j)\in \{1,\cdots, q\}^2. If we consider three different indices i,j,k \ \in \{1,\cdots, q\}^3, it is straightforward to show that the six coefficients indexed by (i,j,k) are all equal. Thus the distance constraints between vectors \boldsymbol{\beta}^i of a group forces equality of all regression coefficients in the group.
The greater the regularization weight \lambda_2, the larger groups become. This is the core principle of the convex relaxation of hierarchical clustering introduced by Hocking et al. (2011). Hence, we can derive a hierarchical clustering structure by spanning the regularization path obtained by varying \lambda_2 while \lambda_1 is fixed. The addition of a fused-type term in graphical models inference has been studied previously by authors such as Honorio et al. (2009), Ganguly and Polonik (2014), Grechkin et al. (2015). However, these existing methods require prior knowledge of the neighborhood of each variable. On the contrary, our approach allows simultaneous inference of a multi-level graphical model and a hierarchical clustering of the variables.
In practice, if some information about the clustering structure is available, the problem can be generalized into:
\min_{\boldsymbol{\beta}} \sum_{i=1}^p\frac{1}{2} \left \lVert \mathbf{X}^i - \mathbf{X}^{\setminus i} \boldsymbol{\beta}^i \right \rVert_2 ^2 + \lambda_1 \sum_{i = 1}^p \left \lVert \boldsymbol{\beta}^i \right \rVert_1 + \lambda_2 \sum_{i < j} w_{ij} \left \lVert \boldsymbol{\beta}^i - \tau_{ij}(\boldsymbol{\beta}^j) \right \rVert_2, \tag{4}
where w_{ij} is a positive weight encoding prior knowledge of the groups to which variables i and j belong to.
3 Numerical scheme
This section introduces a complete numerical scheme to apply MGLasso in practice, using a convex optimization algorithm and a model selection procedure. Section Optimization via CONESTA algorithm reviews the principles of the Continuation with Nesterov smoothing in a shrinkage-thresholding algorithm (CONESTA, (Hadj-Selem et al. 2018)), the optimization algorithm used in practice to solve MGLasso. Section Reformulation of MGLasso for CONESTA algorithm details a reformulation of MGLasso, which enables us to apply CONESTA. Finally, Section Model selection presents the procedure used to select the regularization parameters.
3.1 Optimization via CONESTA algorithm
The optimization problem for Multiscale Graphical Lasso is convex but not straightforward to solve using classical algorithms because of the fused-lasso type penalty, which is non-separable and admits no closed-form solution for the proximal gradient. We rely on the Continuation with Nesterov smoothing in a shrinkage-thresholding algorithm (CONESTA, Hadj-Selem et al. (2018)), dedicated to high-dimensional regression problems with structured sparsity such as group structures.
The CONESTA solver, initially introduced for neuro-imaging problems, addresses a general class of convex optimization problems which includes group-wise penalties.
\operatorname{minimize} \quad f(\boldsymbol{\theta}) = g(\boldsymbol{\theta}) + \lambda_1 h(\boldsymbol{\theta}) + \lambda_2 s(\boldsymbol{\theta}), \tag{5}
In the original paper (Hadj-Selem et al. 2018),
s(\boldsymbol{\theta}) = \sum_{\phi \in \Phi} \|\mathbf{A}_\phi \boldsymbol{\theta}_\phi\|_2.
s_{\mu}(\boldsymbol{\theta}) = \max_{\boldsymbol{\alpha} \in \mathcal{K}} \left \{ \boldsymbol{\alpha}^T \mathbf{A} \boldsymbol{\theta} - \frac{\mu}{2} \| \boldsymbol{\alpha} \|_2^2 \right \},
where \mathcal{K} is
of the matrices \mathbf{A}_\phi and \boldsymbol{\alpha} is an auxiliary variable resulting from the dual reformulation of s(\boldsymbol{\theta}) . Note that \lim_{\mu \rightarrow 0} s_{\mu}(\boldsymbol{\theta}) = s(\boldsymbol{\theta}).
\operatorname{GAP}_{\mu^k}(\boldsymbol{\theta}^k) \ge f_{\mu^k}(\boldsymbol{\theta}^k) - f(\boldsymbol{\theta}^\star) \ge 0.
We refer the reader to the seminal paper for more details on the formulation of \operatorname{GAP}_{\mu^k}(\boldsymbol{\theta}^k). The CONESTA routine is spelled out in the algorithm CONESTA solver where L(g + \lambda_2 s_{\mu}) is the Lipschitz constant of \nabla(g + \lambda_2 s_{\mu}).
\begin{algorithm}
\caption{CONESTA solver}
\begin{algorithmic}
\STATE \textbf{Inputs}: \\
$\quad$ functions $g(\boldsymbol{\theta}), h(\boldsymbol{\theta}), s(\boldsymbol{\theta})$ \\
$\quad$ precision $\epsilon$ \\
$\quad$ penalty parameters $\lambda_1, \lambda_2$ \\
$\quad$ decreasing factor $\tau \in (0,1)$ for sequence of precisions
\STATE \textbf{Output:} \\
$\quad$ $\boldsymbol{\theta}^{i+1} \in \mathbb{R}^d$
\STATE \textbf{Initializations:} \\
$\quad \boldsymbol{\theta}^0 \in \mathbb{R}^d$ \\
$\quad \epsilon^0 = \tau \operatorname{GAP}_{\mu = 10^{-8}}(\boldsymbol{\theta}^0)$ \\
$\quad \mu^0 = \mu_{opt}(\epsilon^0)$
\Repeat
\STATE $\epsilon^i_{\mu} = \epsilon^i - \mu^i \lambda_2 \frac{d}{2}$ \\
\COMMENT{FISTA}
\STATE $k=2$ \COMMENT{new iterator}
\STATE $\boldsymbol{\theta}_{\operatorname{FISTA}}^1 = \boldsymbol{\theta}_{\operatorname{FISTA}}^0 = \boldsymbol{\theta}^i$ \COMMENT{Initial parameters value}
\STATE $t_{\mu} = \frac{1}{L(g + \lambda_2 s_{\mu})}$ \COMMENT{Compute stepsize}
\Repeat
\STATE $\boldsymbol{z} = \boldsymbol{\theta}_{\operatorname{FISTA}}^{k-1} + \frac{k-2}{k+1}(\boldsymbol{\theta}_{\operatorname{FISTA}}^{k-1} - \boldsymbol{\theta}_{\operatorname{FISTA}}^{k-2})$
\STATE $\boldsymbol{\theta}_{\operatorname{FISTA}}^k = \operatorname{prox}_{\lambda_1 h}(\boldsymbol{z} - t_{\mu} \nabla(g + \lambda_2 s_{\mu})(\boldsymbol{z}))$
\Until{$\operatorname{GAP}_{\mu}(\boldsymbol{\theta}_{\operatorname{FISTA}}^k) \le \epsilon_{\mu}^i$}
\STATE $\boldsymbol{\theta}^{i+1} = \boldsymbol{\theta}_{\operatorname{FISTA}}^k$ \\
\STATE $\epsilon^i = \operatorname{GAP}_{\mu = \mu_i} \boldsymbol{\theta}^{i+1} + \mu^i \lambda_2 \frac{d}{2}$ \\
\STATE $\epsilon^{i+1} = \tau \epsilon^{i}$ \\
\STATE $\mu^{i+1} = \mu_{opt}(\epsilon^{i+1})$
\Until{$\epsilon^i \le \epsilon$}
\end{algorithmic}
\end{algorithm}
3.2 Reformulation of MGLasso for CONESTA algorithm
To apply CONESTA, it is necessary to reformulate the MGLasso problem in order to comply with the form of loss function required by CONESTA. The objective of MGLasso can indeed be written as
\operatorname{argmin} \frac{1}{2} ||\mathbf{Y} - \tilde{\mathbf{X}} \tilde{\boldsymbol{\beta}}||_2^2 + \lambda_1 ||\tilde{\boldsymbol{\beta}}||_1 + \lambda_2 \sum_{i<j} ||\boldsymbol D_{ij} \tilde{\boldsymbol{\beta}}||_2, \tag{6}
<<<<<<< HEADwhere \mathbf{Y} = \operatorname{Vec}(\mathbf{X}) \in \mathbb{R}^{np}, \tilde{\boldsymbol{\beta}} = \operatorname{Vec(\boldsymbol{\beta})} \in \mathbb{R}^{p (p-1)}, \tilde{\mathbf{X}} is a \mathbb{R}^{[np]\times [p \times (p-1)]} block-diagonal matrix with \mathbf{X}^{\setminus i} on the i-th block. The matrix \boldsymbol D_{ij} is a (p-1)\times p(p-1) matrix
where \mathbf{Y} = \operatorname{Vec}(\mathbf{X}) \in \mathbb{R}^{np}, \tilde{\boldsymbol{\beta}} = \operatorname{Vec(\boldsymbol{\beta})} \in \mathbb{R}^{p (p-1)}, \tilde{\mathbf{X}} is a \mathbb{R}^{[np]\times [p \times (p-1)]} block-diagonal matrix with \mathbf{X}^{\setminus i} on the i-th block. The matrix \boldsymbol D_{ij} is a (p-1)\times p(p-1) matrix
where \mathbf{Y} = \operatorname{Vec}(\mathbf{X}) \in \mathbb{R}^{np}, \tilde{\boldsymbol{\beta}} = \operatorname{Vec(\boldsymbol{\beta})} \in \mathbb{R}^{p (p-1)}, \tilde{\mathbf{X}} is a \mathbb{R}^{[np]\times [p \times (p-1)]} block-diagonal matrix with \mathbf{X}^{\setminus i} on the i-th block. The matrix \boldsymbol A_{ij} is a (p-1)\times p(p-1) matrix
Note that we introduce this notation for simplicity of exposition, but, in practice, the sparsity of the matrices \boldsymbol D_{ij} allows a more efficient implementation. Based on reformulation Equation 6, we may apply CONESTA to solve the objective of MGLasso for fixed \lambda_1 and \lambda_2. The procedure is applied, for fixed \lambda_1, to a range of decreasing values of \lambda_2 to obtain a hierarchical clustering. The corresponding pseudo-code is given in the following algorithm where (\mathbf{X}^i)^{\dagger} denotes the pseudo-inverse of \mathbf{X}^i and \epsilon_{fuse} the threshold for merging clusters.
\begin{algorithm}
\caption{MGLasso algorithm}
\begin{algorithmic}
\STATE \textbf{Inputs}: \\
$\quad$ Set of variables $\mathbf{X} = \{\mathbf{X}^1, \dots, \mathbf{X}^p \} \in \mathbb R^{n\times p}$ \\
$\quad$ Penalty parameters $\lambda_1 \ge 0, {\lambda_2}_{\operatorname{initial}} > 0$ \\
$\quad$ Increasing factor $\eta > 1$ for fusion penalties $\lambda_2$\\
$\quad$ Fusion threshold $\epsilon_{fuse} \ge 0$
\STATE \textbf{Outputs:} For $\lambda_1$ fixed and $\lambda_2$ from $0$ to ${\lambda_2}_{\operatorname{initial}} \times \eta^{(I)}$ with $I$ the number of iterations: \\
$\quad$ Regression vectors $\boldsymbol{\beta}(\lambda_1, \lambda_2) \in \mathbb R^{p \times (p-1)}$, \\
$\quad$ Clusters of variables indices $C(\lambda_1, \lambda_2) =\{ C_k \}_{(\lambda_1, \lambda_2)}, k=1, \dots, K$
\STATE \textbf{Initializations:} \\
$\quad$ $\boldsymbol{\beta}^i = (\mathbf{X}^i)^{\dagger}\mathbf{X}^i$, $\forall i = 1, \dots, p$ for warm start in CONESTA solver \\
$\quad$ $C = \left \{\{1\}, \dots, \{p\}\right \}$ Initial clusters with one element per cluster. \\
$\quad$ Set $\lambda_2 = 0$ \\
$\quad$ Compute $\boldsymbol{\beta}$ using CONESTA solver (Equivalent to Meinshausen-Bühlmann approach) \\
$\quad$ Update clusters $C$ with rule described in \textbf{while} loop.
\STATE \textbf{Set:} $\lambda_2 = {\lambda_2}_{\operatorname{initial}}$ \\
\COMMENT{Clustering path}
\WHILE{$\operatorname{Card}(C) > 1$}
\STATE Compute $\boldsymbol{\beta}$ using CONESTA solver with warm start from previous iteration \\
\COMMENT{Clusters update}
\STATE Compute pairwises distances $d(i,j)=\left \lVert \boldsymbol{\beta}^i - \tau_{ij}(\boldsymbol{\beta}^j) \right \rVert_2$, $\forall i,j \in \{1, \dots, p\}$ \\
\STATE Determine clusters $C_k (k=1, \dots, K)$ with the rule $(i,j) \in C_k$ iff. $d(i,j) \le \epsilon_{fuse}$
\STATE $\lambda_2 = \lambda_2 \times \kappa$
\ENDWHILE
\end{algorithmic}
\end{algorithm}
3.3 Model selection
A crucial question for practical applications is the definition of a rule to select the penalty parameters (\lambda_1, \lambda_2). This selection problem operates at two levels: \lambda_1 controls the sparsity of the graphical model, and \lambda_2 controls the number of clusters in the optimal clustering partition. These two parameters are dealt with separately: the sparsity parameter \lambda_1 is chosen via model selection, while the clustering parameter \lambda_2 varies across a grid of values, in order to obtain graphs with different levels of granularity. The problem of model selection in graphical models is difficult in the high dimensional case where the number of samples is small compared to the number of variables, as classical AIC and BIC criteria tend to perform poorly (Liu, Roeder, and Wasserman 2010). Alternative criteria have been proposed in the literature, such as cross-validation (Bien and Tibshirani 2011), GGMSelect (Giraud, Huet, and Verzelen 2012), stability selection (Meinshausen and Bühlmann 2010; Liu, Roeder, and Wasserman 2010), Extended Bayesian Information Criterion (EBIC) (Foygel and Drton 2010), and Rotation Information Criterion (Zhao et al. 2012).
In this paper, we focused on the StARS stability selection approach proposed by Liu, Roeder, and Wasserman (2010). The method uses k subsamples of data to estimate the associated graphs for a given range of \lambda_1 values. For each value, a global instability of the graph edges is computed. The optimal value of \lambda_1 is chosen so as to minimize the instability, as follows. Let \lambda^{(1)}_1, \dots, \lambda_1^{(K)} be a grid of sparsity regularization parameters, and S_1, \dots, S_N be N bootstrap samples obtained by sampling the rows of the data set \mathbf{X}. For each k\in\{1,\ldots,K\} and for each j\in\{1,\ldots, N\}, we denote by \mathcal{A}^{k,j}(\mathbf{X}) the adjacency matrix of the estimated graph obtained by applying the inference algorithm to S_n with regularization parameter \lambda_1^{(k)}. For each possible edge (s,t)\in\{1,\ldots,p\}^2, the probability of edge appearance is estimated empirically by
\hat \theta_{st}^{(k)} = \frac{1}{N} \sum_{j=1}^N \mathcal{A}^{k,j}_{st}.
Define
\hat \xi_{st}(\Lambda) = 2 \hat \theta_{st} (\Lambda) \left ( 1 - \hat \theta_{st} (\Lambda) \right )
the empirical instability of edge (s,t) (that is, twice the variance of the Bernoulli indicator of edge (s,t)). The instability level associated to \lambda_1^{(k)} is given by
\hat D(\lambda_1^{(k)}) = \frac{\sum_{s<t} \hat \xi_{st}(\lambda_1^{(k)})}{p \choose 2}, StARS selects the optimal penalty parameter as follows
\hat \lambda = \max_k\left\{ \lambda_1^{(k)}: \hat D(\lambda_1^{(k)}) \le \beta, k\in\{1,\ldots,K\} \right \},
where \beta is the threshold chosen for the instability level.
4 Simulation experiments
In this section, we conduct a simulation study to evaluate the performance of the MGLasso method, both in terms of clustering and support recovery. Receiver Operating Characteristic (ROC) curves are used to evaluate the adequacy of the inferred graphs with the
4.1 Synthetic data models
We consider three different synthetic network models: the Stochastic Block Model (SBM, (Fienberg and Wasserman 1981)), the Erdös-Renyi model (Erdős, Rényi, et al. 1960) and the Scale-Free model (Newman, Strogatz, and Watts 2001). In each case, Gaussian data is generated by drawing n independent realizations of a multivariate Gaussian distribution \mathcal N(0, \mathbf{\Sigma}) where \mathbf{\Sigma} \in \mathbb{R}^{p \times p} and \mathbf{\Omega} = \mathbf{\Sigma} ^{-1}. The support of \mathbf{\Omega}, equivalent to the network adjacency matrix, is generated from the three different models. The difficulty level of the problem is controlled by varying the ratio \frac{n}{p} with p fixed at 40: \frac{n}{p}\in \{0.5,1,2\}.
4.1.1 Stochastic Block-Model
We construct a block-diagonal precision matrix \mathbf{\Omega} as follows. First, we generate the support of \mathbf{\Omega} as shown in Figure 1, denoted by \boldsymbol A\in\{0,1\}^{p\times p}. To do this, the variables are first partitioned into K = 5 hidden groups, noted C_1, \dots, C_K described by a latent random variable Z_i, such that Z_i = k if i = C_k. Z_i follows a multinomial distribution
P(Z_i = k) = \pi_k, \quad \forall k \in \{1, \dots, K\}, where \pi = (\pi_1, \dots, \pi_k) is the vector of proportions of clusters whose sum is equal to one. The set of latent variables is noted \mathbf{Z} = \{ Z_1, \dots, Z_K\}. Conditionally to \mathbf{Z}, A_{ij} follows a Bernoulli distribution such that
A_{ij}|Z_i = k, Z_j = l \sim \mathcal{B}(\alpha_{kl}), \quad \forall k,l \in \{1 \dots, K\},
where \alpha_{kl} is the probability of inter-cluster connectivity, with \alpha_{kl} = 0.01 if k\neq l and \alpha_{ll} = 0,75. For k\in\{1,\ldots, K\}, we define p_k = \sum_{i=1}^p \boldsymbol{1}_{\{Z_i = k\}}. The precision matrix \mathbf{\Omega} of the graph is then calculated as follows. We define \Omega_{ij} = 0 if Z_i\neq Z_j ; otherwise, we define \Omega_{ij} = A_{ij}\omega_{ij} where, for all i\in\{1,\ldots,p\} and for all j\in\{1,\ldots,p| Z_j = Z_i\}, \omega_{ij} is given by :
If \alpha_{ll} were to be equal to one, this construction of \mathbf{\Omega} would make it possible to control the level of correlation between the variables in each block to \rho. Introducing a more realistic scheme with \alpha_{ll}=0.75 allows only to have an approximate control.
Show the code
bloc_diag <- function(n_vars, connectivity_mat, prop_clusters, rho) {
true_nclusters <- 0
while (true_nclusters != length(prop_clusters)){ ## To make sure we have the required number of clusters
network <- simone::rNetwork(n_vars, connectivity_mat, prop_clusters)
true_nclusters <- length(unique(network$clusters))
}
precision_mat <- network$A[order(network$clusters), order(network$clusters)]
eff_clusters <- table(network$clusters)
b = -rho/(1+rho*(eff_clusters-2) -rho^2*(eff_clusters-1))
d = (1+rho*(eff_clusters-2))/(1+rho*(eff_clusters-2) -rho^2*(eff_clusters-1))
for (i in 1:length(eff_clusters)) {
temp <- precision_mat[which(row.names(precision_mat) == i), which(row.names(precision_mat) == i)]
temp <- as.matrix(temp)
temp[temp != 0] = b[i]
diag(temp) = d[i]
precision_mat[which(row.names(precision_mat) == i), which(row.names(precision_mat) == i)] = temp
}
flag <- min(svd(precision_mat)$d)
if(flag < 0){
diag(precision_mat) <- 1 - flag
}
return(precision_mat)
}
#' simulate data with given graph structure
sim_data <- function(p = 20,
np_ratio = 2,
structure = c("block_diagonal", "hub", "scale_free", "erdos"),
alpha,
prob_mat,
rho,
g,
inter_cluster_edge_prob = 0.01,
p_erdos = 0.1,
verbose = FALSE){
structure <- match.arg(structure)
n = round(np_ratio * p)
switch (structure,
erdos = {
network <- simone::rNetwork(p, p_erdos, 1)
correlation <- solve(network$Theta)
X <- mvtnorm::rmvnorm(n, sigma = correlation)
graph <- network$Theta
if(verbose) message("Data, precision matrix and graph generated from Erdos structure.")
},
hub = {
L = huge::huge.generator(graph = "hub", g = 5, d = p, n = n, verbose = FALSE)
X=L$data
graph = L$omega
correlation = L$sigma
if(verbose) message("Data, precision matrix and graph generated from Hub structure.")
},
scale_free = {
L = huge::huge.generator(graph = "scale-free", d = p, n = n, verbose = FALSE) ## d edges graph
X=L$data
graph = L$omega
correlation = L$sigma
if(verbose) message("Data, precision matrix and graph generated from Scale free structure.")
},
block_diagonal = {
if(inter_cluster_edge_prob != 0) { # inter-clusters edges added in the flipped way
flag <- TRUE
while(flag) {
K <- bloc_diag(p, prob_mat, alpha, rho)
target_indices <- which(K[upper.tri(K)] == 0) # Random selection of edges to be set to 1
select_len <- round(length(target_indices) * inter_cluster_edge_prob)
selected_indices <- sample(target_indices, select_len)
precision_level <- unique(K[upper.tri(K)])
precision_level <- max(precision_level[precision_level != 0])
K[upper.tri(K)][selected_indices] <- precision_level
K <- as.matrix(Matrix::forceSymmetric(K, uplo = "U"))
flag <- any(eigen(K)$values <= 0) # Control of positive definiteness
}
correlation <- solve(K)
graph = K
X <- mvtnorm::rmvnorm(n, sigma = correlation)
if(verbose) message("Data, precision matrix and graph generated from block-diagonal
structure with inter-clusters edges.")
}
else { # Only intra-cluster edges while approximately controlling correlation level
K <- bloc_diag(p, prob_mat, alpha, rho)
correlation <- solve(K)
graph = K
X <- mvtnorm::rmvnorm(n, sigma = correlation)
if(verbose) message("Data, precision matrix and graph generated from block-diagonal
structure with only intra-clusters edges.")
}
}
)
return(list(X=X, graph=graph, correlation=correlation))
}
adj_mat <- function(mat) {
diag(mat) <- 0
mat[ abs(mat) < 1e-10] <- 0
mat[mat != 0] <- 1
return(mat)
}
set.seed(2020)
sim_sbm <- sim_data(p = 40, structure = "block_diagonal",
alpha = rep(1/5, 5),
prob_mat = diag(0.75, 5),
rho = 0.2, inter_cluster_edge_prob = 0.01)
gsbm <- adj_mat(sim_sbm$graph)
Matrix::image(as(gsbm, "sparseMatrix"),
sub = "", xlab = "", ylab = "") 4.1.2 Erdös-Renyi Model
The Erdös-Renyi model is a special case of the stochastic block model where \alpha_{kl} = \alpha_{ll} = \alpha is constant. We set the density \alpha of the graph to 0.1; see Figure 2 for an example of the graph resulting from this model.
Show the code
set.seed(2022)
sim_erdos <- sim_data(p = 40, structure = "erdos", p_erdos = 0.1)
gerdos <- adj_mat(sim_erdos$graph)
Matrix::image(as(gerdos, "sparseMatrix"),
sub = "", xlab = "", ylab = "")4.1.3 Scale-free Model
The Scale-free Model generates networks whose degree distributions follow a power law. The graph starts with an initial chain graph of 2 nodes. Then, new nodes are added to the graph one by one. Each new node is connected to an existing node with a probability proportional to the degree of the existing node. We set the number of edges in the graph to 40. An example of scale-free graph is shown in Figure 3.
Show the code
set.seed(2022)
sim_sfree <- sim_data(p = 40, structure = "scale_free")
gsfree <- adj_mat(sim_sfree$graph)
Matrix::image(as(gsfree, "sparseMatrix"),
sub = "", xlab = "", ylab = "")4.2 Support recovery
We compare the network structure learning performance of our approach to that of GLasso in its neighborhood selection version using ROC curves. In both GLasso and MGLasso, the sparsity is controlled by a regularization parameter \lambda_1; however, MGLasso admits an additional regularization parameter, \lambda_2, which controls the strength of convex clustering. To compare the two methods, in each ROC curve, we vary the parameter \lambda_1 while the parameter \lambda_2 (for MGLasso) is kept constant. We computed ROC curves for 4 different penalty levels for the \lambda_2 parameter; since GLasso does not depend on \lambda_2, the GLasso ROC curves are replicated.
In a decision rule associated with a sparsity penalty level \lambda_1, we recall the definition of the two following functions. The sensitivity, also called the true positive rate or recall, is given by : \begin{align*} \lambda_1 &\mapsto \text{sensitivity}(\lambda_1) = \frac{TP(\lambda_1)}{TP(\lambda_1) + FN(\lambda_1)}. \end{align*} Specificity, also called true negative rate or selectivity, is defined as follows: \begin{align*} \lambda_1 &\mapsto \text{specificity}(\lambda_1) = \frac{TN(\lambda_1)}{TN(\lambda_1) + FP(\lambda_1)}. \end{align*} The ROC curve with the parameter \lambda_1 represents \text{sensitivity}(\lambda_1) as a function of 1 - \text{specificity}(\lambda_1) which is the false positive rate.
For each configuration (n, p fixed), we generate 50 replications and their associated ROC curves, which are then averaged. The average ROC curves for the three models are given in Figure 4, Figure 5 and Figure 6 by varying \frac{n}{p}\in \{0.5,1,2\}.
Show the code
path_checks <- "./rscripts/mglasso_functions/"
source(paste0(path_checks, "simulate.R"))
source(paste0(path_checks, "plot.R"))Show the code
# The code used to generate the following results is based on R and python libraries/functions.
# the reticulate package allows to compile python code in Rstudio
# First set up the python dependencies before running the code
library(reticulate)
config <- reticulate::py_config() # Initialize python engine
# It is possible to create an isolated virtual or conda environment in which the code will be compile
# by calling reticulate::virtualenv_create() or reticulate::conda_create() and then activate the environment with reticulate::use_condaenv() or reticulate::use_virtualenv().
system2(config$python, c("-m", "pip", "install",
shQuote("git+https://github.com/neurospin/pylearn-parsimony.git")))
# The pylean-parsimony require scipy version 1.7.1. More recent versions generate compilation errors.
reticulate::py_install(packages = c("scipy == 1.7.1",
"scikit-learn",
"numpy == 1.22.4",
"six", # If needed
"matplotlib"))
# To check if all required python dependencies are available.
testthat::expect_true(reticulate::py_module_available("numpy"))
testthat::expect_true(reticulate::py_module_available("scipy"))
testthat::expect_true(reticulate::py_module_available("six"))
testthat::expect_true(reticulate::py_module_available("matplotlib"))
testthat::expect_true("scikit-learn" %in% reticulate::py_list_packages()$package)
testthat::expect_true("pylearn-parsimony" %in% reticulate::py_list_packages()$package)
# It might be necessary to reload Rstudio on some operator systems.
source("./rscripts/mglasso_functions/onload.R")Show the code
# Performances calculation ------------------------------------------------
# Launched on a cluster using 72 cores
## Settings ----------------------------------------------------------------
### Model -------------------------------------------------------------------
# NA values for some parameter mean they are not relevant
p <- 40
seq_n <- c(20, 40, 80)
seq_rho <- 0.95
seq_dnsty <- NA
type <- NA
alpha <- rep(1/5, 5)
ngroup <- length(alpha)
pi <- diag(0.75, ngroup)
### Simulation --------------------------------------------------------------
n_simu <- 50
list_ii_rho <- configs_simu(n_simu, seq_rho, seq_dnsty, seq_n, type)
no_cores <- min(72, length(list_ii_rho))
### Erdos -------------------------------------------------------------------
runtime_roc_config_p40_erdos01 <- system.time(
roc_config_p40_erdos01 <- mclapply(
list_ii_rho,
FUN = one_simu_ROC,
model = "erdos",
mc.cores = no_cores)
)
save(roc_config_p40_erdos01,
file = paste0(path_roc, "roc_config_p40_erdos01.RData"))
save(runtime_roc_config_p40_erdos01,
file = paste0(path_roc, "runtime_roc_config_p40_erdos01.RData"))
## Erdos
load(paste0(path_roc, "roc_config_p40_erdos01.RData"))
dt_full <- roc_config_p40_erdos01
### Merge in one graph
# Three sample sizes are used and the vector c(20,40,80) is replicated 50 times
# I subset the dataframe in three parts corresponding to the relevant sample sizes
index <- seq(1, 150, by = 3)
roc_dt20 <- dt_full[index]
index <- seq(2, 150, by = 3)
roc_dt40 <- dt_full[index]
index <- seq(3, 150, by = 3)
roc_dt80 <- dt_full[index]
# Here we compute the mean over the 50 ROC curves
roc_dt20 <- get_mean_ROC_stat(roc_dt20)
roc_dt40 <- get_mean_ROC_stat(roc_dt40)
roc_dt80 <- get_mean_ROC_stat(roc_dt80)
# I restructure the list result in a matrix for plot
roc_dt20 <- reformat_roc_res_for_ggplot(roc_dt20)
roc_dt20$np <- 0.5 # add a ratio n over p variable
roc_dt40 <- reformat_roc_res_for_ggplot(roc_dt40)
roc_dt40$np <- 1
roc_dt80 <- reformat_roc_res_for_ggplot(roc_dt80)
roc_dt80$np <- 2
roc_dtf_erdos <- rbind(roc_dt20, roc_dt40, roc_dt80)Show the code
load("./data/roc_dtf_erdos.RData")
ggplot(roc_dtf_erdos, aes(x = fpr,
y = tpr,
color = method )) +
geom_line(size = 0.7) +
facet_grid(np ~ tv) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "grey") +
xlab("False Positive Rate") +
ylab("True Positive Rate") +
ggtitle("") +
scale_colour_manual(values = ghibli_palette("MarnieMedium1")[5:6])Show the code
# Performances calculation ------------------------------------------------
# Launched on a cluster using 72 cores
## Settings ----------------------------------------------------------------
### Model -------------------------------------------------------------------
# NA values for some parameter mean they are not relevant
p <- 40
seq_n <- c(20, 40, 80)
seq_rho <- 0.95
seq_dnsty <- NA
type <- NA
alpha <- rep(1/5, 5)
ngroup <- length(alpha)
pi <- diag(0.75, ngroup)
### Simulation --------------------------------------------------------------
n_simu <- 50
list_ii_rho <- configs_simu(n_simu, seq_rho, seq_dnsty, seq_n, type)
no_cores <- min(72, length(list_ii_rho))
### Scale-Free --------------------------------------------------------------
runtime_roc_config_p40_scalefree <- system.time(
roc_config_p40_scalefree <- mclapply(
list_ii_rho,
FUN = one_simu_ROC,
model = "scale_free",
mc.cores = no_cores)
)
save(roc_config_p40_scalefree,
file = paste0(path_roc, "roc_config_p40_scalefree.RData"))
save(runtime_roc_config_p40_scalefree,
file = paste0(path_roc, "runtime_roc_config_p40_scalefree.RData"))
## Scale-free
load(paste0(path_roc, "roc_config_p40_scalefree.RData"))
dt_full <- roc_config_p40_scalefree
### Merge in one graph
index <- seq(1, 150, by = 3)
roc_dt20 <- dt_full[index]
index <- seq(2, 150, by = 3)
roc_dt40 <- dt_full[index]
index <- seq(3, 150, by = 3)
roc_dt80 <- dt_full[index]
roc_dt20 <- get_mean_ROC_stat(roc_dt20)
roc_dt40 <- get_mean_ROC_stat(roc_dt40)
roc_dt80 <- get_mean_ROC_stat(roc_dt80)
roc_dt20 <- reformat_roc_res_for_ggplot(roc_dt20)
roc_dt20$np <- 0.5
roc_dt40 <- reformat_roc_res_for_ggplot(roc_dt40)
roc_dt40$np <- 1
roc_dt80 <- reformat_roc_res_for_ggplot(roc_dt80)
roc_dt80$np <- 2
roc_dtf_sfree <- rbind(roc_dt20, roc_dt40, roc_dt80)
### Save
save(roc_dtf_sfree,
file = paste0(path_roc, "roc_dtf_sfree.RData"))Show the code
load("./data/roc_dtf_sfree.RData")
ggplot(roc_dtf_sfree, aes(x = fpr,
y = tpr,
color = method )) +
geom_line() +
facet_grid(np ~ tv) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "grey") +
xlab("False Positive Rate") +
ylab("True Positive Rate") +
ggtitle("") +
scale_colour_manual(values = ghibli_palette("MarnieMedium1")[5:6])Show the code
# Launched on a cluster using 72 cores
## Settings ----------------------------------------------------------------
### Model -------------------------------------------------------------------
# NA values for some parameter mean they are not relevant
p <- 40
seq_n <- c(20, 40, 80)
seq_rho <- 0.95
seq_dnsty <- NA
type <- NA
alpha <- rep(1/5, 5)
ngroup <- length(alpha)
pi <- diag(0.75, ngroup)
### Simulation --------------------------------------------------------------
n_simu <- 50
list_ii_rho <- configs_simu(n_simu, seq_rho, seq_dnsty, seq_n, type)
no_cores <- min(72, length(list_ii_rho))
### Stochastic Block Diagonal -----------------------------------------------
runtime_roc_config_p40_bdiagflip001 <- system.time(
roc_config_p40_bdiagflip001 <- mclapply(
list_ii_rho,
FUN = one_simu_ROC,
model = "block_diagonal",
mc.cores = no_cores)
)
save(roc_config_p40_bdiagflip001,
file = paste0(path_roc, "roc_config_p40_bdiagflip001.RData"))
save(runtime_roc_config_p40_bdiagflip001,
file = paste0(path_roc, "runtime_roc_config_p40_bdiagflip001.RData"))
load(paste0(path_roc, "roc_config_p40_bdiagflip001.RData"))
dt_full <- roc_config_p40_bdiagflip001
### Merge in one graph
index <- seq(1, 150, by = 3)
roc_dt20 <- dt_full[index]
index <- seq(2, 150, by = 3)
roc_dt40 <- dt_full[index]
index <- seq(3, 150, by = 3)
roc_dt80 <- dt_full[index]
roc_dt20 <- get_mean_ROC_stat(roc_dt20)
roc_dt40 <- get_mean_ROC_stat(roc_dt40)
roc_dt80 <- get_mean_ROC_stat(roc_dt80)
roc_dt20 <- reformat_roc_res_for_ggplot(roc_dt20)
roc_dt20$np <- 0.5
roc_dt40 <- reformat_roc_res_for_ggplot(roc_dt40)
roc_dt40$np <- 1
roc_dt80 <- reformat_roc_res_for_ggplot(roc_dt80)
roc_dt80$np <- 2
roc_dtf_sbm <- rbind(roc_dt20, roc_dt40, roc_dt80)
save(roc_dtf_sbm,
file = paste0(path_roc, "roc_dtf_sbm.RData"))Show the code
load("./data/roc_dtf_sbm.RData")
ggplot(roc_dtf_sbm, aes(x = fpr,
y = tpr,
color = method )) +
geom_line() +
facet_grid(np ~ tv) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "grey") +
xlab("False Positive Rate") +
ylab("True Positive Rate") +
ggtitle("") +
scale_colour_manual(values = ghibli_palette("MarnieMedium1")[5:6])Based on these empirical results, we first observe that, in all the considered simulation models, MGLasso improves over GLasso in terms of support recovery in the high-dimensional setting where p<n. In addition, in the absence of a total variation penalty, i.e., \lambda_2 = 0, MGLasso performs no worse than GLasso in each of the 3 models. However, for \lambda_2>0, increasing penalty value does not seem to significantly improve the support recovery performances for the MGLasso, as we observe similar results for \lambda_2=3.3,6.6,10. Preliminary analyses show that, as \lambda_2 increases, the estimates of the regression vectors are shrunk towards 0. This shrinkage effect of group-fused penalty terms was also observed in (Chu et al. 2021).
4.3 Clustering
In order to obtain clustering performance, we compared the partitions estimated by MGLasso, Hierarchical Agglomerative Clustering (HAC) with Ward’s distance and K-means to the true partition in a stochastic block model framework. Euclidean distances between variables are used for HAC and K-means. The criterion used for the comparison is the adjusted Rand index. We studied the influence of the correlation level inside clusters on the clustering performances through two different parameters: \rho \in \{ 0.1, 0.3 \}; the vector of cluster proportions is fixed at \mathbf \pi = (1/5, \dots, 1/5). We then simulate 100 Gaussian data sets for each simulation configuration (\rho, n/p fixed).The optimal sparsity penalty for MGLasso is chosen by the Stability Approach to Regularization Selection (StARS) method (Liu, Roeder, and Wasserman 2010), and we vary the parameter \lambda_2.
Show the code
# Launch simulations ------------------------------------------------------
## Settings ----------------------------------------------------------------
### Model -------------------------------------------------------------------
p <- 40
seq_n <- c(20, 40, 80)
alpha <- rep(1/5, 5)
seq_rho <- c(0.25, 0.95)
seq_dnsty <- c(0.75)
type <- NA #1 ## unused to do: delete in configs_simu parameters
ngroup <- length(alpha)
pi <- diag(0.75, ngroup)
### Simulation --------------------------------------------------------------
n_simu <- 100
list_ii_rho <- configs_simu(n_simu, seq_rho, seq_dnsty, seq_n, type)
mc_cores <- min(80, length(list_ii_rho))
RNGkind("L'Ecuyer-CMRG")
## Test --------------------------------------------------------------------
# For a quicker test:
# #set nl2 to 2 in one_simu_extended
# set p = 9 & n = 10
test <- one_simu_extended(list_ii_rho$`1`, verbose = TRUE, model = "block_diagonal")
## Models ------------------------------------------------------------------
# After the quicker test:
# reset nl2 to 20
### Stochastic Block Diagonal -----------------------------------------------
runtime_rand50_config_p40_bdiagflip001_allcor <- system.time(
rand50_config_p40_bdiagflip001_allcor <- mclapply(
list_ii_rho,
FUN = one_simu_extended,
model = "block_diagonal",
mc.cores = mc_cores)
)
save(runtime_rand50_config_p40_bdiagflip001_allcor,
file = paste0(path_extended, "runtime_rand100_config_p40_bdiagflip001_allcor.RData"))
save(rand50_config_p40_bdiagflip001_allcor,
file = paste0(path_extended, "rand100_config_p40_bdiagflip001_allcor.RData"))Show the code
# Clustering
# Settings:
# - Inter-clusters edge probability $0.01$ (flip on all the missing edges)
## Rand index
load(paste0(path_extended, "rand100_config_p40_bdiagflip001_allcor.RData"))
dt <- rand100_config_p40_bdiagflip001_allcor
# Calculate clusters partitions with thresh_fuse as the required difference threshold for merging two regression vectors
list_res <- lapply(dt, function(e){get_perf_from_raw("rand", e, thresh_fuse = 1e-6)})
dt_rand <- do.call(rbind, list_res)
save(dt_rand,
file = paste0(path_extended, "dt_rand_p40_bdiagflip001_allcor_thresh_e6.RData"))
## Theoritical correlation set to $0.25$
plot_res(dt_rand, crit_ = "rand", ncluster_ = c(5, 10, 15, 20), cor_ = 0.25, np_ = c(0.5, 1, 2))
## Theoritical correlation set to $0.95$
plot_res(dt_rand, crit_ = "rand", ncluster_ = c(5, 10, 15, 20), cor_ = 0.95, np_ = c(0.5, 1, 2))
# The files `rand_dt_higher_cor_sbm.RData` and `rand_dt_lower_cor_sbm.RData` are obtained from splitting `dt_rand`according to theoritical correlation levels.Show the code
load("./data/rand_dt_lower_cor_sbm.RData")
plot_res(dt_rand, crit_ = "rand", ncluster_ = c(5, 10, 15, 20), cor_ = 0.25, np_ = c(0.5, 1, 2),
main = "")Show the code
load("./data/rand_dt_higher_cor_sbm.RData")
plot_res(dt_rand, crit_ = "rand", ncluster_ = c(5, 10, 15, 20), cor_ = 0.95, np_ = c(0.5, 1, 2),
main = "")The results shown in Figure 7 and Figure 8 demonstrate that, particularly at the lower to medium levels of the hierarchy (between 20 and 10 clusters), the hierarchical clustering structure uncovered by MGLasso is comparable to popular clustering methods used in practice. For the higher levels (5 clusters), the performances of MGLasso deteriorate. As expected, the three compared methods also deteriorate as the level of correlation inside clusters decreases.
5 Analysis of microbial associations data
We finally illustrate our new method of inferring the multiscale Gaussian graphical model, with an application to the analysis of microbial associations in the American Gut Project. The data used are count data that have been previously normalized by applying the log-centered ratio technique as used in (Kurtz 2015). After some filtering steps (Kurtz 2015) on the operational taxonomic units OTUs counts (removed if present in less than 37\% of the samples) and the samples (removed if sequencing depth below 2700), the top OTUs are grouped in a dataset composed of n_1 = 289 for 127 OTUs. As a preliminary analysis, we perform a hierarchical agglomerative clustering (HAC) on the OTUs, which allows us to identify four significant groups. The correlation matrix of the dataset is given in Figure 9; variables have been rearranged according to the HAC partition.
Show the code
data(amgut1.filt, package = "SpiecEasi")
dta <- amgut1.filt
dta <- t(SpiecEasi::clr(dta + 1 , 1))
S <- cor(dta)
hclust_dta <- hclust(dist(t(dta)), method = "ward.D")
hclust_dta <- hclust(as.dist(1-S^2), method = "ward.D")
cut_dta <- stats::cutree(hclust_dta, 4)
Matrix::image(as(S[order(cut_dta), order(cut_dta)],
"sparseMatrix"),
sub = "", xlab = "", ylab = "",
colorkey = FALSE)The average correlations within blocks of variables belonging to the same cluster are given below. We observe relatively high levels of correlation in small blocks, similar to the simulation models used to evaluate the performance of clustering in the Simulation Experiments section.
Show the code
C <- cor(dta)
diag(C) <- 0
clusters <- cut_dta
seq_p <- 1:length(clusters)
L <- split(seq_p, factor(clusters))
mat <- t(sapply(L,
FUN = function(v) {
summary(as.vector(C[v,v]))
}))
out <- data.frame(Cluster = 1:4, "Mean correlation" = round(mat[, "Mean"], 3))
knitr::kable(out)| Cluster | Mean.correlation |
|---|---|
| 1 | 0.013 |
| 2 | 0.815 |
| 3 | 0.555 |
| 4 | 0.566 |
We apply MGLasso to the normalised counts to infer a graph and a clustering structure. The graphs obtained by MGLasso for \lambda_2 = 0 and for \lambda_2 = 5 (corresponding respectively 127 and 80 clusters) are given below. In each case, the parameter \lambda_1 is chosen by stability selection (see Model Selection section).
Show the code
knitr::include_graphics(c("figures/mglasso_tv0.png","figures/mglasso_full_graph_tv5.png"))The variables were reordered according to the clustering partition obtained from the distances between the regression vectors. Increasing \lambda_2 reduces the number of clusters and leads to a shrinking effect on the estimates. The adjacency matrix of the neighborhood selection equivalent to setting \lambda_2 to 0 is given in Figure 10 (up). In Figure 10 (down), the deduced partition is composed of 80 clusters.